Client Report - Late Flights & Missing Data (JSON)

Course DS 250

Author

Carson Aller

Show the code
import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)

# Load dataset
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")

Question 1 Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.

Show the code
cleaned_df = df.replace(["NaN"], np.nan).convert_dtypes()


row_with_nan = cleaned_df[cleaned_df.isna().any(axis=1)].iloc[0]

row_with_nan
airport_code                         IAD
airport_name                            
month                            January
year                                2005
num_of_flights_total               12381
num_of_delays_carrier                414
num_of_delays_late_aircraft         1058
num_of_delays_nas                    895
num_of_delays_security                 4
num_of_delays_weather                 61
num_of_delays_total                 2430
minutes_delayed_carrier             <NA>
minutes_delayed_late_aircraft      70919
minutes_delayed_nas                35660
minutes_delayed_security             208
minutes_delayed_weather             4497
minutes_delayed_total             134881
Name: 2, dtype: object

QUESTION 2 Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.

Atlanta Georgia has the worst airport according to the attributes provided. They also have a lot of total flights.

Show the code
summary = df.groupby('airport_name').agg(
    total_flights=('num_of_flights_total', 'sum'),
    delayed_flights=('num_of_delays_total', 'sum'),
    proportion_delayed=('num_of_delays_total', lambda x: x.sum() / df['num_of_flights_total'].sum()),
    avg_delay_hours=('minutes_delayed_total', lambda x: x.dropna().mean() / 60)
).reset_index()

summary
airport_name total_flights delayed_flights proportion_delayed avg_delay_hours
0 884879 172413 0.011236 3158.326786
1 Atlanta, GA: Hartsfield-Jackson Atlanta Intern... 4235114 870910 0.056754 6893.514683
2 Chicago, IL: Chicago O'Hare International 3400032 773122 0.050381 6955.351333
3 Denver, CO: Denver International 2323376 439964 0.028671 3232.303689
4 Salt Lake City, UT: Salt Lake City International 1293072 190733 0.012429 1292.472814
5 San Diego, CA: San Diego International 870161 167747 0.010931 1056.324267
6 San Francisco, CA: San Francisco International 1565257 408631 0.026629 3344.965354
7 Washington, DC: Washington Dulles International 773480 152630 0.009946 1284.092287

QUESTION 3 What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month.

To me it looks like November is the best month to fly. Has less then 10,000 delays.

Show the code
ggplot(df.dropna(subset=['month']), aes(x='month', y='num_of_delays_total')) + \
    geom_point() + \
    labs(title='Best Month to Fly', x='Month', y='Number of Delays')

According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table.

Use these three rules for your calculations:

100% of delayed flights in the Weather category are due to weather

30% of all delayed flights in the Late-Arriving category are due to weather

Question 4 From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%

Show the code
late_aircraft_mean = df['num_of_delays_late_aircraft'].mean()
df['num_of_delays_late_aircraft'] = df['num_of_delays_late_aircraft'].fillna(late_aircraft_mean)

def calculate_weather_delays(row):
    weather = row['num_of_delays_weather']  # 100% weather delays
    late_aircraft = row['num_of_delays_late_aircraft'] * 0.30
    nas_factor = 0.40 if row['month'] in ['April', 'May', 'June', 'July', 'August'] else 0.65
    nas = row['num_of_delays_nas'] * nas_factor
    return weather + late_aircraft + nas

df['total_weather_delays'] = df.apply(calculate_weather_delays, axis=1)

df[['airport_name', 'month', 'num_of_delays_weather', 'num_of_delays_late_aircraft', 'num_of_delays_nas', 'total_weather_delays']].head()
airport_name month num_of_delays_weather num_of_delays_late_aircraft num_of_delays_nas total_weather_delays
0 Atlanta, GA: Hartsfield-Jackson Atlanta Intern... January 448 -999 4598 3137.00
1 Denver, CO: Denver International January 233 928 935 1119.15
2 January 61 1058 895 960.15
3 Chicago, IL: Chicago O'Hare International January 306 2255 5415 4502.25
4 San Diego, CA: San Diego International January 56 680 638 674.70

QUESTION 5 Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.

Show the code
weather_summary = df.groupby('airport_name').agg(
    total_flights=('num_of_flights_total', 'sum'),
    total_weather_delays=('total_weather_delays', 'sum')
).reset_index()

weather_summary['proportion_delayed_by_weather'] = weather_summary['total_weather_delays'] / weather_summary['total_flights']

ggplot(weather_summary, aes(x='airport_name', y='proportion_delayed_by_weather')) + \
    geom_point(fill='skyblue') + \
    coord_flip() + \
    labs(title='Proportion of Flights Delayed by Weather by Airport', x='Airport', y='Proportion Delayed')